class: center, middle, inverse, title-slide # Tidy data ##
Data Science & Statistics ### Gavin McNicol ### 2021-09-18 --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- ## .question[ What makes this data not tidy? ] <img src="data:image/png;base64,#img/hyperwar-airplanes-on-hand.png" width="90%" style="display: block; margin: auto;" /> .footnote[ Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html) ] --- .question[ What makes this data not tidy? ] <br> <img src="data:image/png;base64,#img/hiv-est-prevalence-15-49.png" width="100%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- .question[ What makes this data not tidy? ] <br> <img src="data:image/png;base64,#img/us-general-economic-characteristic-acs-2017.png" width="100%" style="display: block; margin: auto;" /> .footnote[ Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt) ] --- ## Displaying vs. summarizing data ### Displaying data .panelset[ .panel[.panel-name[Output] .pull-left[ ``` ## # A tibble: 87 × 3 ## name height mass ## <chr> <int> <dbl> ## 1 Luke Skywalker 172 77 ## 2 C-3PO 167 75 ## 3 R2-D2 96 32 ## 4 Darth Vader 202 136 ## 5 Leia Organa 150 49 ## 6 Owen Lars 178 120 ## 7 Beru Whitesun lars 165 75 ## 8 R5-D4 97 32 ## 9 Biggs Darklighter 183 84 ## 10 Obi-Wan Kenobi 182 77 ## # … with 77 more rows ``` ] ] .panel[.panel-name[Code] .pull-right[ ```r starwars %>% select(name, height, mass) ``` ] ] ] --- ## Displaying vs. summarizing data ### Summarizing data .panelset[ .panel[.panel-name[Output] .pull-left[ ``` ## # A tibble: 3 × 2 ## gender avg_ht ## <chr> <dbl> ## 1 feminine 165. ## 2 masculine 177. ## 3 <NA> 181. ``` ] ] .panel[.panel-name[Code] .pull-right[ ```r starwars %>% group_by(gender) %>% summarize( avg_ht = mean(height, na.rm = TRUE) ) ``` ] ] ] --- .center[ .large[ This class content was built from the Data Science in a Box source materials. https://datasciencebox.org/index.html ] ]